Introduction

In this last lecture we’ll introduce a couple of new concepts: cross validation and feature selection.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
##   method                   from   
##   required_pkgs.model_spec parsnip
## ── Attaching packages ────────────────────────────────────── tidymodels 0.1.4 ──
## ✓ broom        0.7.9      ✓ rsample      0.1.0 
## ✓ dials        0.0.10     ✓ tune         0.1.6 
## ✓ infer        1.0.0      ✓ workflows    0.2.4 
## ✓ modeldata    0.1.1      ✓ workflowsets 0.1.0 
## ✓ parsnip      0.1.7      ✓ yardstick    0.0.8 
## ✓ recipes      0.1.17
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## x scales::discard() masks purrr::discard()
## x dplyr::filter()   masks stats::filter()
## x recipes::fixed()  masks stringr::fixed()
## x dplyr::lag()      masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(knitr)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(modelr)
## 
## Attaching package: 'modelr'
## The following objects are masked from 'package:yardstick':
## 
##     mae, mape, rmse
## The following object is masked from 'package:broom':
## 
##     bootstrap

The Data

We’re going to work with a subset of the data that only includes a few variables, which are selected below.

mv<-readRDS("mv.Rds")%>%
  filter(!is.na(budget))%>%
  mutate(log_gross=log(gross))%>%
  mutate(year=as_factor(year))%>%
  select(
    title,
    log_gross,
         budget,
         rating,
         genre,
         runtime,
         year)%>%
  drop_na()

As usual, we’ll split the data 75/25.

split_data<-initial_split(mv)

mv_train<-training(split_data)

mv_test<-testing(split_data)

Cross Validation

The essence of prediction is discovering the extent to which our models can predict outcomes for data that does not come from our sample. Many times this process is temporal. We fit a model to data from one time period, then take predictors from a subsequent time period to come up with a prediction in the future. For instance, we might use data on team performance to predict the likely winners and losers for upcoming soccer games.

This process does not have to be temporal. We can also have data that is out of sample because it hadn’t yet been collected when our first data was collected, or we can also have data that is out of sample because we designated it as out of sample.

The data that is used to generate our predictions is known as training data. The idea is that this is the data used to train our model, to let it know what the relationship is between our predictors and our outcome. So far, we have worked mostly with training data.

That data that is used to validate our predictions is known as testing data. With testing data, we take our trained model and see how good it is at predicting outcomes using out of sample data.

One very simple approach to this would be to cut our data in half. This is what we’ve done so far. We could then train our model on half the data, then test it on the other half. This would tell us whether our measure of model fit (e.g. rmse, auc) is similar or different when we apply our model to out of sample data.

But this would only be a “one-shot” approach. It would be better to do this multiple times, cutting the data into two parts: training and testing, then fitting the model to the training data, and then checking its predictions against the testing data. That way, we could generate a large number of rmse’s to see how well the model fits on lots of different possible out-of-sample predictions.

This process is called cross validation, and it involves two important decisions: first, how will the data be cut, and how many times will the validation run.

We’re going to cut our training dataset 75/25, and we’ll repeat that 25 times. This is so our code will run faster– we would really want to do this more like 1,000 times in practice.

Monte Carlo Resampling

mv_rs<-mc_cv(mv_train,times=25) ## More like 1000 in practice

Lasso for Feature Selection

One of the key decisions for an analyst is which variables to include. We can make decisions about this using theory, or our understanding of the context, but we can also rely on computational approaches. This is known as regularization and it involves downweighting the importance of coefficients from a model based on the contribution that a predictor makes. We’re going to make use of a regularization penalty known as the “lasso.” The lasso downweights variables mostly be dropping variables that are highly correlated with one another, leaving only one of the correlated variables as contributors to the model. We set the degree to which this penalty will be implemented by setting the “penalty” variable in the model specification.

penalty_spec<-.1

mixture_spec<-1

lasso_fit<- 
  linear_reg(penalty=penalty_spec,
             mixture=mixture_spec) %>% 
  set_engine("glmnet")%>%
  set_mode("regression")

Define the Workflow

movie_wf<-workflow()

Add the Model

movie_wf<-movie_wf%>%
  add_model(lasso_fit)

Set Formula

In setting the recipe for this model, we’re now going to include ever variable in the dataset. This is very common in these kinds of applications.

movie_formula<-as.formula("log_gross~.")

Recipe

Because we have so many predictors, we need to generalize our process for feature engineering. Instead of running steps on particular variables, we’re going to use the capabilities of tidymodels to select types of variables.

movie_rec<-recipe(movie_formula,mv)%>%
  update_role(title,new_role="id variable")%>%
  update_role(log_gross,new_role="outcome")%>%  ## specify dv
  step_log(budget)%>% 
  step_other(all_nominal_predictors(),threshold = .01)%>%
  step_dummy(all_nominal_predictors())%>%
  step_normalize(all_predictors())%>% ## Convert all to Z scores
  step_naomit(all_predictors()) ## drop missing

To see what this does, we can use the prep and bake commands

mv_processed<-movie_rec%>%prep()%>%bake(mv_train)

Now we can add our model to the recipe.

movie_wf<-movie_wf%>%
  add_recipe(movie_rec)

Fit resamples

To fit a model to resampled data, we use fit_resamples. It’s now going to fit the model to the training data and then test the model against the testing data for all 25 of our resampled datasets.

movie_lasso_fit<-movie_wf%>%
  fit_resamples(mv_rs)

Examine resamples and fit

Once we run this model, we get a measure of model fit from every testing dataset. collect_metrics will let us see the average.

movie_lasso_fit%>%
  collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   1.25     25 0.0111  Preprocessor1_Model1
## 2 rsq     standard   0.488    25 0.00796 Preprocessor1_Model1

CV results

movie_lasso_fit%>%
  unnest(.metrics)%>%
  filter(.metric=="rmse")%>%
  ggplot(aes(x=.estimate))+
  geom_density()

Finalize Workflow

movie_lasso_final <- finalize_workflow(movie_wf,
                                      select_best(movie_lasso_fit,metric="rmse")) %>%
  fit(mv_test)

Parameter Estimates

movie_lasso_final%>%
  extract_fit_parsnip()%>%
  tidy()%>%
  kable()
term estimate penalty
(Intercept) 17.9122330 0.1
budget 1.0890334 0.1
runtime 0.0603028 0.1
rating_Not.Rated -0.0879209 0.1
rating_PG 0.0000000 0.1
rating_PG.13 0.0000000 0.1
rating_R -0.1220468 0.1
rating_other -0.0688450 0.1
genre_Adventure 0.0000000 0.1
genre_Animation 0.0416245 0.1
genre_Biography -0.0016633 0.1
genre_Comedy 0.0000000 0.1
genre_Crime 0.0000000 0.1
genre_Drama -0.0312343 0.1
genre_Horror 0.0395539 0.1
genre_other 0.0000000 0.1
year_X2001 0.0000000 0.1
year_X2002 0.0000000 0.1
year_X2003 0.0000000 0.1
year_X2004 0.0000000 0.1
year_X2005 0.0000000 0.1
year_X2006 0.0000000 0.1
year_X2007 0.0000000 0.1
year_X2008 0.0000000 0.1
year_X2009 0.0000000 0.1
year_X2010 0.0000000 0.1
year_X2011 0.0000000 0.1
year_X2012 0.0000000 0.1
year_X2013 0.0000000 0.1
year_X2014 0.0000000 0.1
year_X2015 0.0000000 0.1
year_X2016 0.0000000 0.1
year_X2017 0.0000000 0.1
year_X2018 0.0000000 0.1
year_X2019 0.0018820 0.1
year_other 0.0000000 0.1

Prediction in the testing dataset

movie_lasso_final%>%last_fit(split_data)%>%
  collect_metrics()
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       1.23  Preprocessor1_Model1
## 2 rsq     standard       0.538 Preprocessor1_Model1

Model Tuning

The problem with the above is that I arbitrarily set the value of penalty to be .1. Do I know this was correct? No! What we need to do is try out a bunch of different values of the penalty, and see which one gives us the best model fit. This process has the impressive name of “hyperparameter tuning” but it could just as easily be called “trying a bunch of stuff to see what works.”

Below I’m going to give the argument tune() for the value of penalty. This will allow us to “fill in” values later.

lasso_tune_fit<- 
  linear_reg(penalty=tune(),mixture=mixture_spec)%>% 
  set_engine("glmnet")

Update Workflow

movie_wf<-movie_wf%>%
  update_model(lasso_tune_fit)

Create Grid for Model Tuning

A tuning grid is a set of numbers we might want to use. I use the function grid_regular and ask it to give me 10 possible values of the penalty.

lasso_grid<-grid_regular(parameters(lasso_tune_fit) ,levels=10)

Fit Using the Grid

movie_lasso_tune_fit <- 
  movie_wf %>%
    tune_grid(mv_rs,grid=lasso_grid)

Examine Results

Lets’ take a look and see which models fit better.

movie_lasso_tune_fit%>%
  collect_metrics()%>%
  filter(.metric=="rmse")%>%
  arrange(mean)
## # A tibble: 10 × 7
##          penalty .metric .estimator  mean     n std_err .config              
##            <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1 0.0000000001  rmse    standard    1.22    25  0.0112 Preprocessor1_Model01
##  2 0.00000000129 rmse    standard    1.22    25  0.0112 Preprocessor1_Model02
##  3 0.0000000167  rmse    standard    1.22    25  0.0112 Preprocessor1_Model03
##  4 0.000000215   rmse    standard    1.22    25  0.0112 Preprocessor1_Model04
##  5 0.00000278    rmse    standard    1.22    25  0.0112 Preprocessor1_Model05
##  6 0.0000359     rmse    standard    1.22    25  0.0112 Preprocessor1_Model06
##  7 0.000464      rmse    standard    1.22    25  0.0112 Preprocessor1_Model07
##  8 0.00599       rmse    standard    1.22    25  0.0111 Preprocessor1_Model08
##  9 0.0774        rmse    standard    1.24    25  0.0110 Preprocessor1_Model09
## 10 1             rmse    standard    1.61    25  0.0138 Preprocessor1_Model10

It looks like using a very small penalty (like basically none) is the way to go.

Plot Results

Let’s confirm what we learned by plotting the results.

movie_lasso_tune_fit%>%
unnest(.metrics)%>%
  filter(.metric=="rmse")%>%
  mutate(tune_id=paste0("penalty=",prettyNum(penalty,digits=4))) %>%
  select(tune_id,.estimate)%>%
  rename(RMSE=.estimate)%>%
  ggplot(aes(x=RMSE,color=tune_id,fill=tune_id))+
  geom_density(alpha=.1)

What this is telling us is that big penalty values are a really bad idea, and that most of the lower penalty values fit the data just fine.

Choose best model and fit to training data

We can choose the best model and then fit it to our training dataset.

movie_final<-
  finalize_workflow(movie_wf,
                    select_best(movie_lasso_tune_fit,
                                metric="rmse"))%>%
  fit(mv_train)

Examine Parameter Estimates

movie_final%>%
  extract_fit_parsnip()%>%
  tidy()%>%
  mutate(penalty=prettyNum(penalty,digits=4))%>%
  kable()
term estimate penalty
(Intercept) 17.9326261 1e-10
budget 1.0109962 1e-10
runtime 0.2002049 1e-10
rating_PG -0.0524470 1e-10
rating_PG.13 -0.1902265 1e-10
rating_R -0.4104647 1e-10
rating_other -0.2221898 1e-10
genre_Adventure -0.0038235 1e-10
genre_Animation 0.0858639 1e-10
genre_Biography -0.0995839 1e-10
genre_Comedy 0.0000000 1e-10
genre_Crime -0.0723087 1e-10
genre_Drama -0.0897002 1e-10
genre_Horror 0.2081184 1e-10
genre_other -0.0219192 1e-10
year_X2001 0.0624647 1e-10
year_X2002 0.0746581 1e-10
year_X2003 0.0970850 1e-10
year_X2004 0.1068933 1e-10
year_X2005 0.0599426 1e-10
year_X2006 0.0657748 1e-10
year_X2007 0.1060437 1e-10
year_X2008 0.0773221 1e-10
year_X2009 0.0622520 1e-10
year_X2010 0.0513064 1e-10
year_X2011 0.1450414 1e-10
year_X2012 0.0924123 1e-10
year_X2013 0.1307665 1e-10
year_X2014 0.1522685 1e-10
year_X2015 0.1008641 1e-10
year_X2016 0.0950039 1e-10
year_X2017 0.1620189 1e-10
year_X2018 0.1430990 1e-10
year_X2019 0.1056369 1e-10
year_other 0.0513446 1e-10

This is almost identical to what we would have gotten had we just used OLS, but we know now for certain!

Make Prediction

pred_df<-movie_final%>%
  predict(mv_test)%>%
  rename(`Predicted Gross`=.pred)%>%
  bind_cols(mv_test)%>%
  rename(`Actual Gross`=log_gross)
rmse_final<-yardstick::rmse(pred_df,truth = `Actual Gross`,estimate = `Predicted Gross`)

rmse_final
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        1.19
gg<-pred_df%>%
  ggplot(aes(y=`Actual Gross`,x=`Predicted Gross`,text=paste(title,"<br>",
                                                             `Actual Gross`,
                                                             `Predicted Gross`)))+
  geom_point(alpha=.25,size=.5)+
  scale_x_continuous(trans="log",labels=label_dollar())+
  scale_y_continuous(trans="log",labels=label_dollar()) 
  

ggplotly(gg,tooltip="text")

Or Just

movie_final%>%last_fit(split_data)%>%
  collect_metrics()
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       1.19  Preprocessor1_Model1
## 2 rsq     standard       0.560 Preprocessor1_Model1

Choose best model and fit to full data

Once we’re sure that we at our very last model, we can get estimates from the full dataset.

movie_final<-
  finalize_workflow(movie_wf,
                    select_best(movie_lasso_tune_fit,
                                metric="rmse"))%>%
  fit(mv) ## FULL dataset
movie_final%>%
  pull_workflow_fit()%>%
  tidy()%>%
  mutate(penalty=prettyNum(penalty))%>%
  kable()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
term estimate penalty
(Intercept) 17.9275230 1e-10
budget 1.0114691 1e-10
runtime 0.2150985 1e-10
rating_Not.Rated -0.2367444 1e-10
rating_PG -0.0825919 1e-10
rating_PG.13 -0.1971165 1e-10
rating_R -0.4269710 1e-10
rating_other -0.1045923 1e-10
genre_Adventure -0.0024900 1e-10
genre_Animation 0.0965011 1e-10
genre_Biography -0.1129967 1e-10
genre_Comedy -0.0103016 1e-10
genre_Crime -0.0611427 1e-10
genre_Drama -0.1030142 1e-10
genre_Horror 0.1926570 1e-10
genre_other -0.0085011 1e-10
year_X2001 0.0464245 1e-10
year_X2002 0.0606343 1e-10
year_X2003 0.0772238 1e-10
year_X2004 0.0935106 1e-10
year_X2005 0.0653598 1e-10
year_X2006 0.0534027 1e-10
year_X2007 0.0957125 1e-10
year_X2008 0.0835390 1e-10
year_X2009 0.0591848 1e-10
year_X2010 0.0589645 1e-10
year_X2011 0.1183523 1e-10
year_X2012 0.0847530 1e-10
year_X2013 0.1229433 1e-10
year_X2014 0.1256413 1e-10
year_X2015 0.0992271 1e-10
year_X2016 0.0905710 1e-10
year_X2017 0.1493679 1e-10
year_X2018 0.1299486 1e-10
year_X2019 0.1127220 1e-10
year_other 0.0563972 1e-10

This is what we would use for any incoming data.Let’s say the proposal is to make either a horror or adventure movie for 10 million. We’ve also been pitched movies that will be rated R– which of course better reflects the director’s “artistic vision” and movies that will be rated PG-13. Let’s see what our model says about these.

newdata<-data_grid(mv,
          title="Data Science 3: The Tuning",         
          budget=1e7,
          rating=c("R","PG-13"),
          genre=c("Horror","Adventure"),
          runtime=100,
          year=as_factor(2020))

movie_final%>%
  predict(newdata)%>%
  bind_cols(newdata)%>%
  mutate(low_dollar_amount=dollar(exp(.pred-rmse_final$.estimate)))%>%
  mutate(mean_dollar_amount=dollar(exp(.pred)))%>%
  mutate(hi_dollar_amount=dollar(exp(.pred+rmse_final$.estimate)))
## # A tibble: 4 × 10
##   .pred title                        budget rating genre     runtime year  low_dollar_amou…
##   <dbl> <chr>                         <dbl> <chr>  <chr>       <dbl> <fct> <chr>           
## 1  17.8 Data Science 3: The Tuning 10000000 PG-13  Adventure     100 2020  $16,952,348     
## 2  18.8 Data Science 3: The Tuning 10000000 PG-13  Horror        100 2020  $42,843,692     
## 3  17.4 Data Science 3: The Tuning 10000000 R      Adventure     100 2020  $10,732,728     
## 4  18.3 Data Science 3: The Tuning 10000000 R      Horror        100 2020  $27,124,838     
## # … with 2 more variables: mean_dollar_amount <chr>, hi_dollar_amount <chr>

What should we recommend as a result?